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Abstract 

To detect rare variants associated with a phenotype, we develop a novel statistical method that can use both 
family and unrelated case-control data. Unlike the currently existing methods, we first use family data to calculate 
weights to be given to rare variants, differentiating between concordantly affected and discordant sib pairs. These 
weights are then used in an association test applied to the unrelated case-control data. We applied the proposed 
method to the simulated sequencing data in Genetic Analysis Workshop 17 and identified two genes associated 
with the disease. 
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Background 

Genome-wide association studies, which are based on 
the common disease/common variants assumption, have 
successfully identified susceptibility loci for complex 
traits. However, the variants discovered through these 
studies explain only a modest portion of the trait varia- 
bility [1]. With the new technological advances, it has 
been suggested that it is time to shift the search from 
common variants of modest effect to rarer variants of 
large effect by effectively searching the full genome [2]. 
Rare variants may hold promise to predict individual 
risk for personalized medicine because of their large 
effect, although it has been argued that common var- 
iants illuminate the biologic pathways that underlie dis- 
eases [3]. 

Bodmer and Tomlinson [4] suggested that a set of 
low-frequency variants from different genes can account 
for a significant proportion of the variability of relatively 
common diseases. To achieve reasonable statistical 
power, it is critical to define the rare variants and test 
them collectively. The existing statistical methods in the 
literature mainly collapse rare variants [5]. Madsen and 
Browning [6] proposed using the inverse of the variance 
of the minor allele frequency (MAF) in control subjects 
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as a weight and then collapsing the weighted rare 
variants. 

Briefly, for the ith individual Madsen and Browning 
[6] define a genetic score: 

L 

X i=^ w )Zij> (i) 

where L is the number of variants, g t j is the genoty- 
pic score, and Wj is the weight for the ;th single- 
nucleotide polymorphism (SNP). The weight Wj is 
defined as the inverse of the ;th SNP's standard devia- 
tion estimated in control subjects when the corre- 
sponding MAF is less than a (such as 0.02) and 0 
otherwise. Then the Wilcoxon rank sum test is applied 
to do the association test. Madsen and Browning rank 
the genetic scores, calculate the sum of the ranks for 
case subjects as: 

X= X rank ( X ')' (2) 

ie cases 

and calculate the ^-value using a permutation strategy. 
That is, they permute disease status among individuals 
1,000 times to compute 1,000 statistics X, denoted x[ > 
X* 2 > ^1000 • Then they use the sample mean ju and 
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the standard deviation £ of x[, X* 2 > • ^1000 to 
calculate the test statistic: 



where Af sib is the number of sib pairs. For the ;th SNP, 
define: 



Z = 



X-ft 



(3) 



which follows approximately a standard normal distri- 
bution under the null hypothesis. Madsen and Browning 
[6] demonstrated that this weighted-sum method is 
more powerful than the collapsing method [5]. 

Recently, we have demonstrated that family data are 
useful for searching for rare variants [7,8], because the 
rare variants can be substantially enriched among segre- 
gating family members. Here, we present a statistical 
method to test rare variants by using both family and 
unrelated case-control sequencing data. 

Methods 

Defining a weight for each SNP using the family data set 

We start by assuming that we have L variants belonging 
to a group (gene, pathway, specific genomic region, 
etc.). We have shown that family data, such as affected 
sib pairs, have enriched information for detecting rare 
disease-associated variants because the same disease var- 
iants are more likely to segregate within a family [7]. 
Thus each family can be considered homogeneous; that 
is, affected family members share the same or allelic dis- 
ease variants, the latter being an idea used in traditional 
linkage analysis. Our method uses either affected sib 
pairs or discordant sib pairs to determine weights for 
such rare variants. 

Assume that a SNP has two alleles A and a, where A 
always refers to the minor allele. For the /th sib pair and 
the yth SNP, we define a genotype score gy . Let a be a 
predefined threshold. We always let gy = 0 if a SNP 
has a MAF greater than a. For those SNPs with MAF < 
a , we define gy as follows. For affected sib pairs, 
gy = 1 if both sibs carry A, gy = 0 if neither sib carries 
A, and gy = K / L if one sib carries A and the other 
does not, where K is the number of other SNPs in the 
region with MAF <a carried by the other sib. For dis- 
cordant sib pairs, gy = 1 if the affected sib carries allele 
A and the unaffected sib does not, gy =0 if the affected 
sib does not carry A, and gy = K / L if both sibs carry 
A, where K is the number of alleles with MAF <a pre- 
sent at the other SNPs carried by the affected sib. 

Let: 



1 J ^sib 



(4) 



1/2 



(5) 



where pj is the MAF of the SNP, estimated from the 
case and control subjects combined. We rank the bj in 
descending order and give the yth SNP weight wj - bj if 
bj falls in the top quartile and weight Wj = 0 if not. The 
proposed weight incorporates information about the 
variants shared (not shared) by affected (discordant) sib 
pairs. 

Performing association tests 

After defining the weights for individual SNPs, using 
either affected or discordant sib pairs, we use the 
weights to test for association between the phenotype 
and each set of variants in a group. We assume that we 
have N D unrelated case subjects and N c unrelated 
control subjects. Let k be the /cth unrelated individual. 
For each SNP / (j = 1, L), we let Wj be the weight as 
found earlier using the sib pairs. We use the Wj to calcu- 
late the /<th individual's genetic score, similar to the idea 
of Madsen and Browning [6]; that is, 



(6) 



where is the number of minor alleles in SNP / for 
individual k. We then define: 



N D L 



Yd 



u fe=i ° fe=i i=i 



(7) 



where the summation is over the unrelated case 
subjects. Similarly, we define: 



N c L 



Yc 



77r£^^££^ 

u fe=l c fe=l 1=1 



(8) 



calculated for the unrelated control subjects. Our null 
hypothesis is that no marker is associated with the 
phenotype; that is, we have E[y D ] = E[y c ], and so we 
define our test statistic for association as: 



Yd~Yc 



( S f D +S fc) 1/2 ' 



(9) 
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where S^ D and S^ c are estimates of the sample var- 
iances var(7 D ) and var(7~ c ) in case and control sub- 
jects, respectively. 

Application to the Genetic Analysis Workshop 17 
sequence data set 

The Genetic Analysis Workshop 17 (GAW17) simulated 
sequence data set includes both family and unrelated 
individuals data, which is ideal for the proposed method. 
Each replicate is composed of 697 individuals in the 
family data set and 697 individuals in the unrelated indi- 
viduals data set. There are 200 replicates in total, all 
with the same genotype data but with phenotype data 
independently simulated across the replicates. 

We first performed an analysis using only one repli- 
cate, which was selected to be the 82nd replicate. To 
calculate the weights, we clustered the sib pairs into 
affected and discordant sib pairs. Based on the 82nd 
phenotype replicate, we identified 38 affected and 22 
discordant sib pairs. We then set a = 0.01 as the MAF 
cutoff to define the rare variants and calculated weights 
based on the identified 38 affected sib pairs using the 
Eq. (4-5). 

The 82nd phenotype replicate includes 209 unrelated 
case subjects and 488 unrelated control subjects. This 
data set was used to test for association in each of 3,205 
genes. 

We compared the power of the proposed method with 
Madsen and Browning's method using the same unre- 
lated case-control data. The same MAF threshold of a = 
0.01 was used to define rare variants for Madsen and 
Browning's method. 

We expect there to be little or no power for detecting 
rare variants using a single replicate because of the 
small sample size. We therefore included additional 
replicates until we reached a sample size of 400 affected 
sib pairs and 2,400 case subjects and 2,400 control 
subjects. We used 400 affected sib pairs, which was 
suggested by our previous study [7]. Although the 
genotypes are the same for the different replicates, the 
genotype-phenotype association is simulated indepen- 
dently for the different replicates. Thus the way we 
increased the sample size will have little impact on our 
power comparison. 

Results 

We applied the proposed method and Madsen and 
Browning's method to 3,205 genes using the 82nd repli- 
cate. As expected, we found virtually no power for 
either method. 

We next increased the sample size to 400 affected sib 
pairs to calculate the weights and an additional 2,000 
case subjects and 2,000 control subjects for the associa- 
tion test using our proposed method. For comparison, 



we used 2,400 case subjects and 2,400 control subjects 
for Madsen and Browning's method. Thus the total 
sample size is the same for both methods. Figure 1 pre- 
sents the result for testing 3,205 genes. The horizontal 
line indicates the 5% significance level after adjusting for 
3,205 tests. After correcting for multiple comparisons, 
we observed 14 and 7 genes reaching significance for 
the proposed and Madsen and Browning's methods, 
respectively; 2 of 14 and 1 of 7 significant genes are real 
causal signals. 

We next examined the 15 causal genes generated in 
the simulations (Table 1). Both methods detected 
PICK3C2B, but our proposed method resulted in a 
much smaller ^-value (8.67 x 1(T 10 vs. 1.48 x 1(T 5 ). 
Furthermore, the proposed method identified 
HSP90AA1 (p = 8.63 x 10~ 6 ), which was missed by 
Madsen and Browning's method. 

Discussion and conclusions 

In this paper, we propose a novel method to analyze the 
GAW17 data set. Unlike the existing methods, the pro- 
posed method calculates the weights using either 
affected or discordant sib pairs. The proposed method 
requires that both the case and control groups and the 
family members are genotyped for the same set of SNPs 
or are resequenced in the same region. Compared with 
Madsen and Browning's method, the proposed method 
detects the true causal gene HSP90AA1, which was 
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Figure 1 Manhattan plot of -log 10 (p-value) for each single 
gene after pooling data to reach a larger sample size The 

upper panel is the result of our proposed method, and the lower 
panel is the result of Madsen and Browning's method. 



Feng et al. BMC Proceedings 201 1, 5(Suppl 9):S80 
http://www.biomedcentral.eom/1753-6561/5/S9/S80 



Page 4 of 4 



Table 1 Causal gene with tested p-value 


Gene 


p-value (Madsen and Browning 
method) 


p-value (our 
method) 


AKT3 


0.1 73 


1 .0 


BCL2L 1 1 


0.0555 


1.0 


ELAVL4 


0.90 


0.0283 


HSP90AA 1 


0.737 


8 63 x 10~ 6 


NRAS 


0.459 


1 .0 


PIK3C2B 


1 .48 x 1 0~ 5 


8.67 x 10~ 10 


PIK3C3 


0.796 


1.0 


PIK3R3 


0.732 


0.236 


PRKCA 


0.564 


1.0 


PRKCB1 


0.684 


0.236 


PTK2 


0.868 


1.0 


PTK2B 


0.433 


1.0 


RRAS 


0.480 


1.0 


SHC1 


0.925 


1.0 


S0S2 


0.465 


1.0 



The causal genes of latent liability disease and their p-values found using 
Madsen and Browning's method and our proposed method. 



missed by Madsen and Browning's method. Our method 
demonstrates that incorporating family data can poten- 
tially improve statistical power to detect rare variants in 
an association analysis, which is consistent with the 
result of Zhu et al [7]. 
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